Use case example¶

This page shows one complete workflow from login to production execution.

Prerequisites¶

Before starting:

Run train.py first in an interactive GPU session, then in batch mode on prod10.

ssh dgx
# if you do not use an SSH alias:
# ssh <username>@hubia-dgx.centralesupelec.fr
mkdir -p ~/my_project
cd ~/my_project

python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install numpy torch

Quick check:

python -c "import numpy; print(numpy.__version__)"

srun -p interactive10 --time=00:30:00 --pty bash

Inside the interactive shell:

cd ~/my_project
source venv/bin/activate
python train.py
exit

Use the default template:

cd ~/my_project
cp ~/slurm-prod10.sbatch ./job.sbatch
nano job.sbatch

Set at least:

sbatch job.sbatch
squeue -u $USER

Inspect one job:

scontrol show job <jobid>
sacct -j <jobid> --format=JobID,State,Elapsed,ExitCode

Read logs:

tail -n 100 slurm-<jobid>.out

Cancel if needed:

scancel <jobid>

If the model does not fit in prod10 (10 GB VRAM), move to larger partitions:

Keep the same workflow, only change partition/time and script content.

You can use VS Code Remote-SSH to edit files on the DGX.

If the extension gets stuck, try: